Ensure you have completed any necessary setup instructions before proceeding.
# Install BiocManager
install.packages("BiocManager", quiet = TRUE)
# Download and read dependencies file
download.file(
"https://raw.githubusercontent.com/EleanorSC/High-Dimensional-Statistics/main/dependencies.csv",
destfile = 'dependencies.csv'
)
table <- read.table('dependencies.csv')
# Install dependencies using BiocManager
BiocManager::install(table[[1]])
# Create a directory for data files
dir.create("data", showWarnings = FALSE)
# List of data files to download
data_files <- c(
"cancer_expression.rds",
"coefHorvath.rds",
"methylation.rds",
# "scRNAseq_data.rds", #only to download if there is time for part 6 of lecture
"prostate.rds"
)
# Download data files into the "data" directory
for (file in data_files) {
download.file(
url = file.path(
"https://raw.githubusercontent.com/EleanorSC/High-Dimensional-Statistics/main/Data",
file
),
destfile = file.path("data", file)
)
}Descriptions of four research questions and their datasets are given below. Which of these scenarios use high-dimensional data?
Load the dataset using the here package. Ensure the
prostate.rds file is located in the data
folder of your project directory.
# ----------------------------------------------------------------------
# Examine the Dataset
# ----------------------------------------------------------------------
# The dataset represents clinical data where each row corresponds to a single patient.
# Let's first examine the dimensions of the dataset to understand its structure.
# Determine the number of observations (n) and features (p)
dimensions <- dim(prostate) # Returns a vector: [number of rows, number of columns]
print(dimensions) # Print the dimensions## [1] 97 9
# Examine the variables measured in the dataset
variable_names <- names(prostate) # Returns the column names (variables)
print(variable_names)## [1] "lcavol" "lweight" "age" "lbph" "svi" "lcp" "gleason"
## [8] "pgg45" "lpsa"
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
# ----------------------------------------------------------------------
# Visualize Relationships Between Variables
# ----------------------------------------------------------------------
# Plot pairwise relationships between variables using the `pairs()` function.
# This function creates a scatterplot matrix to visualize the relationships
# between all pairs of numeric variables.
pairs(prostate)1). Use the cor() function to examine correlations
between all variables in the prostate dataset. Are some
pairs of variables highly correlated using a threshold of 0.75 for the
correlation coefficients?
2). Use the lm() function to fit univariate regression
models to predict patient age using two variables that are highly
correlated as predictors. Which of these variables are statistically
significant predictors of age? Hint: the summary() function
can help here.
3). Fit a multiple linear regression model predicting patient age using both variables. What happened?
# ----------------------------------------------------------------------
# 1. Examine Correlations Between Variables
# ----------------------------------------------------------------------
# Use the cor() function to compute pairwise correlations between all variables
# in the prostate dataset. Identify pairs of variables with a correlation
# coefficient greater than 0.75 (absolute value).
# Compute the correlation matrix for the dataset
cor(prostate)## lcavol lweight age lbph svi lcp
## lcavol 1.0000000 0.194128286 0.2249999 0.027349703 0.53884500 0.675310484
## lweight 0.1941283 1.000000000 0.3075286 0.434934636 0.10877851 0.100237795
## age 0.2249999 0.307528614 1.0000000 0.350185896 0.11765804 0.127667752
## lbph 0.0273497 0.434934636 0.3501859 1.000000000 -0.08584324 -0.006999431
## svi 0.5388450 0.108778505 0.1176580 -0.085843238 1.00000000 0.673111185
## lcp 0.6753105 0.100237795 0.1276678 -0.006999431 0.67311118 1.000000000
## gleason 0.4324171 -0.001275658 0.2688916 0.077820447 0.32041222 0.514830063
## pgg45 0.4336522 0.050846821 0.2761124 0.078460018 0.45764762 0.631528245
## lpsa 0.7344603 0.354120390 0.1695928 0.179809410 0.56621822 0.548813169
## gleason pgg45 lpsa
## lcavol 0.432417056 0.43365225 0.7344603
## lweight -0.001275658 0.05084682 0.3541204
## age 0.268891599 0.27611245 0.1695928
## lbph 0.077820447 0.07846002 0.1798094
## svi 0.320412221 0.45764762 0.5662182
## lcp 0.514830063 0.63152825 0.5488132
## gleason 1.000000000 0.75190451 0.3689868
## pgg45 0.751904512 1.00000000 0.4223159
## lpsa 0.368986803 0.42231586 1.0000000
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## lcavol 1.00 0.19 0.22 0.03 0.54 0.68 0.43 0.43 0.73
## lweight 0.19 1.00 0.31 0.43 0.11 0.10 0.00 0.05 0.35
## age 0.22 0.31 1.00 0.35 0.12 0.13 0.27 0.28 0.17
## lbph 0.03 0.43 0.35 1.00 -0.09 -0.01 0.08 0.08 0.18
## svi 0.54 0.11 0.12 -0.09 1.00 0.67 0.32 0.46 0.57
## lcp 0.68 0.10 0.13 -0.01 0.67 1.00 0.51 0.63 0.55
## gleason 0.43 0.00 0.27 0.08 0.32 0.51 1.00 0.75 0.37
## pgg45 0.43 0.05 0.28 0.08 0.46 0.63 0.75 1.00 0.42
## lpsa 0.73 0.35 0.17 0.18 0.57 0.55 0.37 0.42 1.00
# As seen above, some variables are highly correlated.
# In particular, the correlation between gleason and pgg45 is equal to 0.75
cor_matrix <- cor(prostate, use = "complete.obs")
high_correlations <- which(abs(cor_matrix) > 0.75 & abs(cor_matrix) < 1, arr.ind = TRUE)
# Extract and display the highly correlated variable pairs
high_cor_pairs <- tibble::tibble(
Var1 = rownames(cor_matrix)[high_correlations[, 1]],
Var2 = colnames(cor_matrix)[high_correlations[, 2]],
Correlation = cor_matrix[high_correlations]
)
print(high_cor_pairs)## # A tibble: 2 × 3
## Var1 Var2 Correlation
## <chr> <chr> <dbl>
## 1 pgg45 gleason 0.752
## 2 gleason pgg45 0.752
# ----------------------------------------------------------------------
# 2. Fit Univariate Regression Models
# ----------------------------------------------------------------------
# Fitting univariate regression models to predict age using `gleason` and `pgg45` as predictors.
# Use lm() to fit univariate regression models predicting patient age
# using two highly correlated variables. Evaluate statistical significance
# of each predictor using summary().
model_gleason <- lm(age ~ gleason, data = prostate)
model_pgg45 <- lm(age ~ pgg45, data = prostate)
# Summarize results for each model
summary(model_gleason) # Check significance of gleason##
## Call:
## lm(formula = age ~ gleason, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.780 -3.552 1.448 4.220 13.448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.146 6.918 6.525 3.29e-09 ***
## gleason 2.772 1.019 2.721 0.00774 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.209 on 95 degrees of freedom
## Multiple R-squared: 0.0723, Adjusted R-squared: 0.06254
## F-statistic: 7.404 on 1 and 95 DF, p-value: 0.007741
##
## Call:
## lm(formula = age ~ pgg45, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.0889 -3.4533 0.9111 4.4534 15.1822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.08890 0.96758 64.17 < 2e-16 ***
## pgg45 0.07289 0.02603 2.80 0.00619 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.193 on 95 degrees of freedom
## Multiple R-squared: 0.07624, Adjusted R-squared: 0.06651
## F-statistic: 7.84 on 1 and 95 DF, p-value: 0.006189
# ----------------------------------------------------------------------
# 3. Fit a Multiple Regression Model
# ----------------------------------------------------------------------
# Fit a multiple regression model predicting patient age using both variables.
# Examine what happens when both correlated variables are included.
# Fit a multiple regression model
model_multivar <- lm(age ~ gleason + pgg45, data = prostate)
# Summarize the multiple regression model
summary(model_multivar)##
## Call:
## lm(formula = age ~ gleason + pgg45, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.927 -3.677 1.323 4.323 14.420
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.95548 9.74316 5.435 4.3e-07 ***
## gleason 1.45363 1.54299 0.942 0.349
## pgg45 0.04490 0.03951 1.137 0.259
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.198 on 94 degrees of freedom
## Multiple R-squared: 0.08488, Adjusted R-squared: 0.06541
## F-statistic: 4.359 on 2 and 94 DF, p-value: 0.01547
# This workshop focuses on statistical methods for visualizing and analyzing
# high-dimensional biological data using Bioconductor packages.
# Bioconductor is an open-source platform designed for analyzing high-throughput genomic data.
# It provides a variety of useful packages and example datasets.
# More details and resources are available at: https://www.bioconductor.org/
# Bioconductor packages can be installed and managed using the BiocManager package.
# Let's load the "minfi" package, a Bioconductor package specifically designed
# for analyzing Illumina Infinium DNA methylation arrays.
# BiocManager::install("minfi")
library("minfi")
browseVignettes("minfi")
#methylation <- readRDS("methylation.rds")
methylation <- readRDS(here("data/methylation.rds"))
head(colData(methylation))## DataFrame with 6 rows and 14 columns
## Sample_Well Sample_Name purity Sex Age
## <character> <character> <integer> <character> <integer>
## 201868500150_R01C01 A07 PCA0612 94 M 39
## 201868500150_R03C01 C07 NKpan2510 95 M 49
## 201868500150_R05C01 E07 WB1148 95 M 20
## 201868500150_R07C01 G07 B0044 97 M 49
## 201868500150_R08C01 H07 NKpan1869 95 F 33
## 201868590193_R02C01 B03 NKpan1850 93 F 21
## weight_kg height_m bmi bmi_clas Ethnicity_wide
## <numeric> <numeric> <numeric> <character> <character>
## 201868500150_R01C01 88.4505 1.8542 25.7269 Overweight Mixed
## 201868500150_R03C01 81.1930 1.6764 28.8911 Overweight Indo-European
## 201868500150_R05C01 80.2858 1.7526 26.1381 Overweight Indo-European
## 201868500150_R07C01 82.5538 1.7272 27.6727 Overweight Indo-European
## 201868500150_R08C01 87.5433 1.7272 29.3452 Overweight Indo-European
## 201868590193_R02C01 87.5433 1.6764 31.1507 Obese Mixed
## Ethnic_self smoker Array Slide
## <character> <character> <character> <numeric>
## 201868500150_R01C01 Hispanic No R01C01 2.01869e+11
## 201868500150_R03C01 Caucasian No R03C01 2.01869e+11
## 201868500150_R05C01 Persian No R05C01 2.01869e+11
## 201868500150_R07C01 Caucasian No R07C01 2.01869e+11
## 201868500150_R08C01 Caucasian No R08C01 2.01869e+11
## 201868590193_R02C01 Finnish/Creole No R02C01 2.01869e+11
methyl_mat <- t(assay(methylation))
## calculate correlations between cells in matrix
cor_mat <- cor(methyl_mat)
cor_mat[1:10, 1:10] # print the top-left corner of the correlation matrix## cg00075967 cg00374717 cg00864867 cg00945507 cg01027739
## cg00075967 1.00000000 -0.54187597 -0.23545127 -0.03328856 0.15814866
## cg00374717 -0.54187597 1.00000000 0.20194482 -0.01824105 -0.11722829
## cg00864867 -0.23545127 0.20194482 1.00000000 -0.07989013 -0.25388089
## cg00945507 -0.03328856 -0.01824105 -0.07989013 1.00000000 0.32991673
## cg01027739 0.15814866 -0.11722829 -0.25388089 0.32991673 1.00000000
## cg01353448 0.55187311 -0.37130021 0.02410416 0.07201148 0.08380460
## cg01584473 -0.52114045 0.30207176 0.49376101 -0.06213370 -0.04972067
## cg01644850 -0.08130044 0.08972931 0.07465884 0.21406721 0.35845482
## cg01656216 -0.27776967 -0.33015779 0.20665253 -0.14690180 -0.49462321
## cg01873645 0.18174953 -0.10881732 -0.02333667 0.17384071 0.18106477
## cg01353448 cg01584473 cg01644850 cg01656216 cg01873645
## cg00075967 0.551873113 -0.52114045 -0.081300436 -0.277769670 0.18174953
## cg00374717 -0.371300212 0.30207176 0.089729312 -0.330157788 -0.10881732
## cg00864867 0.024104157 0.49376101 0.074658838 0.206652529 -0.02333667
## cg00945507 0.072011477 -0.06213370 0.214067207 -0.146901796 0.17384071
## cg01027739 0.083804601 -0.04972067 0.358454820 -0.494623205 0.18106477
## cg01353448 1.000000000 -0.24710600 -0.001904264 -0.003425993 0.07776795
## cg01584473 -0.247106001 1.00000000 0.013019805 0.263537096 -0.01537876
## cg01644850 -0.001904264 0.01301980 1.000000000 -0.331328760 0.19820041
## cg01656216 -0.003425993 0.26353710 -0.331328760 1.000000000 -0.24664137
## cg01873645 0.077767947 -0.01537876 0.198200410 -0.246641372 1.00000000
library("here")
library("minfi")
methylation <- readRDS(here("data/methylation.rds"))
# Check the dimensions of the dataset using dim().
# Note: In computational biology data structures in R, observations are stored as columns
# and features (e.g., genomic sites) are stored as rows.
# This contrasts with typical tabular data where features are columns and observations are rows.
# Access assay data (e.g., normalised methylation levels) using assay().
# Retrieve sample-level information using colData().
dim(methylation)## [1] 5000 37
# The output shows that the object has dimensions of 5000 × 37,
# indicating 5000 features and 37 observations.
# To extract the matrix of methylation M-values, use the assay() function.
methyl_mat <- assay(methylation)
hist(methyl_mat, xlab = "M-value")## DataFrame with 6 rows and 14 columns
## Sample_Well Sample_Name purity Sex Age
## <character> <character> <integer> <character> <integer>
## 201868500150_R01C01 A07 PCA0612 94 M 39
## 201868500150_R03C01 C07 NKpan2510 95 M 49
## 201868500150_R05C01 E07 WB1148 95 M 20
## 201868500150_R07C01 G07 B0044 97 M 49
## 201868500150_R08C01 H07 NKpan1869 95 F 33
## 201868590193_R02C01 B03 NKpan1850 93 F 21
## weight_kg height_m bmi bmi_clas Ethnicity_wide
## <numeric> <numeric> <numeric> <character> <character>
## 201868500150_R01C01 88.4505 1.8542 25.7269 Overweight Mixed
## 201868500150_R03C01 81.1930 1.6764 28.8911 Overweight Indo-European
## 201868500150_R05C01 80.2858 1.7526 26.1381 Overweight Indo-European
## 201868500150_R07C01 82.5538 1.7272 27.6727 Overweight Indo-European
## 201868500150_R08C01 87.5433 1.7272 29.3452 Overweight Indo-European
## 201868590193_R02C01 87.5433 1.6764 31.1507 Obese Mixed
## Ethnic_self smoker Array Slide
## <character> <character> <character> <numeric>
## 201868500150_R01C01 Hispanic No R01C01 2.01869e+11
## 201868500150_R03C01 Caucasian No R03C01 2.01869e+11
## 201868500150_R05C01 Persian No R05C01 2.01869e+11
## 201868500150_R07C01 Caucasian No R07C01 2.01869e+11
## 201868500150_R08C01 Caucasian No R08C01 2.01869e+11
## 201868590193_R02C01 Finnish/Creole No R02C01 2.01869e+11
# Association between age and DNA methylation:
# The following heatmap summarises age and methylation levels available in the methylation dataset:
library("ComplexHeatmap")
age <- methylation$Age
# sort methylation values by age
order <- order(age)
age_ord <- age[order]
methyl_mat_ord <- methyl_mat[, order]
# plot heatmap
Heatmap(methyl_mat_ord,
cluster_columns = FALSE,
show_row_names = FALSE,
show_column_names = FALSE,
name = "M-value",
row_title = "Feature",
column_title = "Sample",
top_annotation = columnAnnotation(age = age_ord))Why can’t we simply fit many linear regression models for every combination of features (colData and assays) and draw conclusions based on significant p-values?
There are several problems with this approach:
# ----------------------------------------------------------------------
# CHALLENGE 1
# ----------------------------------------------------------------------
# Arbitrarily select the first CpG in the methyl_mat matrix (the one on its first row):
age <- methylation$Age
# methyl_mat[1, ] indicates that the 1st CpG will be used as outcome variable
lm_age_methyl1 <- lm(methyl_mat[1, ] ~ age)
lm_age_methyl1##
## Call:
## lm(formula = methyl_mat[1, ] ~ age)
##
## Coefficients:
## (Intercept) age
## 0.902334 0.008911
# We now have estimates for the expected methylation level when age equals 0 (the intercept)
# and the change in methylation level for a unit change in age (the slope).
# We could plot this linear model:
plot(age, methyl_mat[1, ], xlab = "Age", ylab = "Methylation level", pch = 16)
abline(lm_age_methyl1)# For this linear model, we can use tidy() from the broom package to extract detailed
# information about the coefficients and the associated hypothesis tests in this model:
library("broom")
tidy(lm_age_methyl1)## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.902 0.344 2.62 0.0129
## 2 age 0.00891 0.0100 0.888 0.381
# ----------------------------------------------------------------------
# CHALLENGE 2
# ----------------------------------------------------------------------
# SOLUTION:
# The first coefficient, the intercept, represents the mean methylation value
# for the first CpG when age is zero (estimated as 0.902).
# However, this is not meaningful since there are no observations with age zero
# or below 20 in the dataset.
# The p-value tests whether the intercept (β₀) is zero,
# but this is not relevant since methylation levels at age zero are not of interest.
# The key focus is on the regression coefficient for age,
# as it shows whether there is a linear relationship between age and methylation.# What is a model matrix?
design_age <- model.matrix(lm_age_methyl1) # model matrix
head(design_age)## (Intercept) age
## 201868500150_R01C01 1 39
## 201868500150_R03C01 1 49
## 201868500150_R05C01 1 20
## 201868500150_R07C01 1 49
## 201868500150_R08C01 1 33
## 201868590193_R02C01 1 21
## [1] 37 2
# The model matrix has the same number of rows as the methylation data has samples.
# It has two columns: one for the intercept (like in the linear model we fit earlier)
# and one for age.
# When using lm(), this step happens automatically, but here we specify the model matrix directly.
# For more complex experimental designs, the limma user manual provides guidance on creating model matrices.
# In this case, we use a simple two-variable model.
# We use the lmFit() function to fit the model, passing the methylation data and model matrix.
# Internally, lmFit() efficiently runs lm() for each row of the data.
# The eBayes() function, applied to the lmFit() output, performs pooled standard error estimation,
# resulting in moderated t-statistics and p-values.
library("limma")
design_age <- model.matrix(lm_age_methyl1) # model matrix
fit_age <- lmFit(methyl_mat, design = design_age)
fit_age <- eBayes(fit_age)
# Use the topTable() function to obtain the results of the linear models.
# By default, topTable() returns results for the first coefficient in the model.
# The first coefficient corresponds to the intercept term, which is not of interest here.
# Specify coef = 2 to focus on the second coefficient (age in this case).
# By default, topTable() returns only the top 10 results.
# To see all results, set number = nrow(fit_age), ensuring a row for every input row.
toptab_age <- topTable(fit_age, coef = 2, number = nrow(fit_age))
head(toptab_age)## logFC AveExpr t P.Value adj.P.Val B
## cg08446924 -0.02571353 -0.4185868 -6.039068 5.595675e-07 0.002797837 5.131574
## cg06493994 0.01550941 -2.1057877 5.593988 2.239813e-06 0.005599533 3.747986
## cg17661642 0.02266668 -2.0527722 5.358739 4.658336e-06 0.006048733 3.019698
## cg05168977 0.02276336 -2.2918472 5.346500 4.838987e-06 0.006048733 2.981904
## cg24549277 0.01975577 -1.7466088 4.939242 1.708355e-05 0.011508818 1.731821
## cg04436528 -0.01943612 0.7033503 -4.917179 1.828563e-05 0.011508818 1.664608
# The output of topTable includes several columns:
# - logFC: The coefficient estimate (log fold change), representing effect size.
# - aveExpr: The average expression level.
# - t: The t-statistic for the coefficient.
# - P.Value: The p-value for the test.
# - adj.P.Val: The adjusted p-value (we'll discuss adjusted p-values shortly).
# - B: The log-odds that a feature is significantly different (a transformation of the p-value, not covered here).
# The term logFC is used for historical reasons from microarray experiments.
# These results provide effect sizes and p-values for the association between methylation levels
# at each locus and age across the 37 samples.
# To visualise effect sizes (coefficients) and statistical significance (p-values),
# we can plot effect sizes against p-values for all linear models.
# Such plots are called "volcano plots" because their shape resembles an eruption.
plot(toptab_age$logFC, -log10(toptab_age$P.Value),
xlab = "Effect size", ylab = bquote(-log[10](p-value)),
pch = 19
)# In this figure:
# - Each point represents a feature of interest.
# - The x-axis shows the effect size from a linear model.
# - The y-axis shows −log10(p-value), where higher values indicate stronger statistical evidence
# of a non-zero effect size.
# - Positive effect sizes indicate increasing methylation with age.
# - Negative effect sizes indicate decreasing methylation with age.
# - Points higher on the y-axis represent features with results unlikely under the null hypothesis.
# The goal is to identify features with different methylation levels across age groups.
# Ideally, there would be a clear separation between “null” (no effect) and “non-null” (effect exists) features.
# However, in practice, we often see a continuum of effect sizes and p-values without clear separation.
# Statistical methods can provide insights from these continuous measures,
# but it is often useful to generate a list of features with confident non-zero effect sizes.
# This is challenging due to the large number of tests performed.Try fitting a linear model using smoking status as a covariate
instead of age, and create a volcano plot. Note: Smoking status is
stored as methylation$smoker.
In the lecture example, we saw that information sharing can result in larger p-values. Why might this be preferable?
# ----------------------------------------------------------------------
# CHALLENGE 3
# ----------------------------------------------------------------------
# 1.
design_smoke <- model.matrix(~methylation$smoker)
fit_smoke <- lmFit(methyl_mat, design = design_smoke)
fit_smoke <- eBayes(fit_smoke)
toptab_smoke <- topTable(fit_smoke, coef = 2, number = nrow(fit_smoke))
plot(toptab_smoke$logFC, -log10(toptab_smoke$P.Value),
xlab = "Effect size", ylab = bquote(-log[10](p)),
pch = 19
)# ----------------------------------------------------------------------
# With a large number of features, it is important to determine which features are "interesting"
# or "significant" for further study.
# Using a standard significance threshold of 0.05 may lead to many false positives.
# A p-value threshold of 0.05 means there is a 1 in 20 chance of observing results
# as extreme or more extreme under the null hypothesis (no association between age and methylation).
# If we perform many more than 20 tests, we are likely to observe significant p-values
# purely due to random chance, even when the null hypothesis is true.
# To illustrate this, we can permute (scramble) the age values and rerun the test
# to see how random chance affects the results.
set.seed(123)
age_perm <- age[sample(ncol(methyl_mat), ncol(methyl_mat))]
design_age_perm <- model.matrix(~age_perm)
fit_age_perm <- lmFit(methyl_mat, design = design_age_perm)
fit_age_perm <- eBayes(fit_age_perm)
toptab_age_perm <- topTable(fit_age_perm, coef = 2, number = nrow(fit_age_perm))
plot(toptab_age_perm$logFC, -log10(toptab_age_perm$P.Value),
xlab = "Effect size", ylab = bquote(-log[10](p)),
pch = 19
)
abline(h = -log10(0.05), lty = "dashed", col = "red")A random sequence of ages was generated, so no true association between methylation levels and this random sequence is expected. Despite this, many features still show p-values below the traditional significance threshold of p = 0.05. In this example, 226 features are significant at p < 0.05, which demonstrates how using this threshold in a real experiment could falsely identify features as associated with age purely by chance.
When performing multiple tests, features are classified as either “significant” or “non-significant,” but some classifications will inevitably be incorrect. Results fall into four categories:
At a significance level of 5%, 5% of results will be false positives (false discoveries) by chance, as p-values are uniformly distributed under the null hypothesis.
To control false discoveries, the Bonferroni correction can be applied:
Divide the significance threshold by the number of tests (n). Alternatively, multiply the p-values by the number of tests. However, Bonferroni is very conservative, especially with a large number of features.
p_raw <- toptab_age$P.Value
p_fwer <- p.adjust(p_raw, method = "bonferroni")
plot(p_raw, p_fwer, pch = 16, log="xy")
abline(0:1, lty = "dashed")
abline(v = 0.05, lty = "dashed", col = "red")
abline(h = 0.05, lty = "dashed", col = "red")A = threshold = 0.05 / 100.
# ----------------------------------------------------------------------
# CHALLENGE 4
# ----------------------------------------------------------------------
# 2. We can’t say what proportion of these genes are truly different. However, if we repeated this
# experiment and statistical test over and over, on average 5% of the results from each run
# would be false discoveries.
# Try applying FDR correction to the p_raw vector.
# Hint: Use the p.adjust() function and check help("p.adjust") for details on the method.
# A = The following code runs FDR correction and compares it to non-corrected values and to Bonferroni:
p_fdr <- p.adjust(p_raw, method = "BH")
plot(p_raw, p_fdr, pch = 16, log="xy")
abline(0:1, lty = "dashed")
abline(v = 0.05, lty = "dashed", col = "red")
abline(h = 0.05, lty = "dashed", col = "red")# plot of chunk plot-fdr-fwer
plot(p_fwer, p_fdr, pch = 16, log="xy")
abline(0:1, lty = "dashed")
abline(v = 0.05, lty = "dashed", col = "red")
abline(h = 0.05, lty = "dashed", col = "red")# ----------------------------------------------------------------------
# REGULARISATION
# ----------------------------------------------------------------------
library("here")
library("minfi")
methylation <- readRDS(here("data/methylation.rds"))
## here, we transpose the matrix to have features as rows and samples as columns
methyl_mat <- t(assay(methylation))
age <- methylation$AgeThen, we try to fit a model with outcome age and all 5,000 features in this dataset as predictors (average methylation levels, M-values, across different sites in the genome).
# by using methyl_mat in the formula below, R will run a multivariate regression
# model in which each of the columns in methyl_mat is used as a predictor.
fit <- lm(age ~ methyl_mat)You can see that we’re able to get some effect size estimates, but they seem very high! The summary also says that we were unable to estimate effect sizes for 4,964 features because of “singularities”. We clarify what singularities are in the note below but this means that R couldn’t find a way to perform the calculations necessary to fit the model. Large effect sizes and singularities are common when naively fitting linear regression models with a large number of features (i.e., to high-dimensional data), often since the model cannot distinguish between the effects of many, correlated features or when we have more features than observations.
A singularity occurs when a matrix used in a linear model cannot be inverted, which is necessary for calculating the model’s coefficients. This happens if:
There are more features (predictors) than observations. The features are highly correlated. In such cases, R cannot perform the required calculations and returns an error about singularities.
## [1] 0
Consider or discuss in groups: 1. Why would we observe correlated features in high-dimensional biological data? 2. Why might correlated features be a problem when fitting linear models? 3. What issue might correlated features present when selecting features to include in a model one at a time?
Many of the features in biological data represent very similar information biologically. For example, sets of genes that form complexes are often expressed in very similar quantities. Similarly, methylation levels at nearby sites are often very highly correlated.
Correlated features can make inference unstable or even impossible mathematically.
When we are selecting features one at a time we want to pick the most predictive feature each time. When a lot of features are very similar but encode slightly different information, which of the correlated features we select to include can have a huge impact on the later stages of model selection!
Sets of models are often compared using statistics such as adjusted R2, AIC or BIC. These show us how well the model is learning the data used in fitting that same model 1. However, these statistics do not really tell us how well the model will generalise to new data. This is an important thing to consider – if our model doesn’t generalise to new data, then there’s a chance that it’s just picking up on a technical or batch effect in our data, or simply some noise that happens to fit the outcome we’re modelling. This is especially important when our goal is prediction – it’s not much good if we can only predict well for samples where the outcome is already known, after all!
To get an idea of how well our model generalises, we can split the data into two - a “training” and a “test” set. We use the “training” data to fit the model, and then see its performance on the “test” data.
One thing that often happens in this context is that large coefficient values minimise the training error, but they don’t minimise the test error on unseen data. First, we’ll go through an example of what exactly this means.
To compare the training and test errors for a model of methylation features and age, we’ll split the data into training and test sets, fit a linear model and calculate the errors. First, let’s calculate the training error. Let’s start by splitting the data into training and test sets:
methylation <- readRDS(here::here("data/methylation.rds"))
library("SummarizedExperiment")
age <- methylation$Age
methyl_mat <- t(assay(methylation))Subset significant CpGs:
cpg_markers <- c("cg16241714", "cg14424579", "cg22736354", "cg02479575", "cg00864867",
"cg25505610", "cg06493994", "cg04528819", "cg26297688", "cg20692569",
"cg04084157", "cg22920873", "cg10281002", "cg21378206", "cg26005082",
"cg12946225", "cg25771195", "cg26845300", "cg06144905", "cg27377450"
)
horvath_mat <- methyl_mat[, cpg_markers]
## Generate an index to split the data
set.seed(42)
train_ind <- sample(nrow(methyl_mat), 25)
## Split the data
train_mat <- horvath_mat[train_ind, ]
train_age <- age[train_ind]
test_mat <- horvath_mat[-train_ind, ]
test_age <- age[-train_ind]Now let’s fit a linear model to our training data and calculate the training error. Here we use the mean of the squared difference between our predictions and the observed data, or “mean squared error” (MSE).
## Fit a linear model
# as.data.frame() converts train_mat into a data.frame
# Using the `.` syntax above together with a `data` argument will lead to
# the same result as using `train_age ~ train_mat`: R will fit a multivariate
# regression model in which each of the columns in `train_mat` is used as
# a predictor. We opted to use the `.` syntax because it will help us to
# obtain model predictions using the `predict()` function.
fit_horvath <- lm(train_age ~ ., data = as.data.frame(train_mat))
## Function to calculate the (mean squared) error
mse <- function(true, prediction) {
mean((true - prediction)^2)
}
## Calculate the training error
err_lm_train <- mse(train_age, fitted(fit_horvath))
err_lm_train## [1] 1.319628
First, let’s find the predicted values for the ‘unseen’ test data:
The mean squared error for the test set is the mean of the squared error between the predicted values and true test data.
## [1] 223.3571
Unfortunately, the test error is a lot higher than the training error. If we plot true age against predicted age for the samples in the test set, we can gain more insight into the performance of the model on the test set. Ideally, the predicted values should be close to the test data.
Ridge regression adds a penalty to the sum of squared coefficients (L2 norm) to control model complexity. This penalty, scaled by a factor λ, balances reducing large coefficients and fitting the data.
We’ll use the glmnet package to compare regularized and ordinary
least squares models using a subset of 20 features
(cpg_markers) identified earlier.
library("glmnet")
## glmnet() performs scaling by default, supply un-scaled data:
horvath_mat <- methyl_mat[, cpg_markers] # select the same 20 sites as before
train_mat <- horvath_mat[train_ind, ] # use the same individuals as selected before
test_mat <- horvath_mat[-train_ind, ]
ridge_fit <- glmnet(x = train_mat, y = train_age, alpha = 0)
plot(ridge_fit, xvar = "lambda")
abline(h = 0, lty = "dashed")
Since we split the data into test and training data, we can prove that
ridge regression predicts the test data better than the model with no
regularisation. Let’s generate our predictions under the ridge
regression model and calculate the mean squared error in the test
set:
# Obtain a matrix of predictions from the ridge model,
# where each column corresponds to a different lambda value
pred_ridge <- predict(ridge_fit, newx = test_mat)
# Calculate MSE for every column of the prediction matrix against the vector of true ages
err_ridge <- apply(pred_ridge, 2, function(col) mse(test_age, col))
min_err_ridge <- min(err_ridge)
# Identify the lambda value that results in the lowest MSE (ie, the "best" lambda value)
which_min_err <- which.min(err_ridge)
pred_min_ridge <- pred_ridge[, which_min_err]
## Return errors
min_err_ridge## [1] 46.76802
This is much lower than the test error for the model without regularisation:
## [1] 223.3571
We can see where on the continuum of lambdas we’ve picked a model by plotting the coefficient paths again. In this case, we’ve picked a model with fairly modest coefficient shrinking.
chosen_lambda <- ridge_fit$lambda[which.min(err_ridge)]
plot(ridge_fit, xvar = "lambda")
abline(v = log(chosen_lambda), lty = "dashed")Which performs better, ridge or OLS?
Plot predicted ages for each method against the true ages. How do the predictions look for both methods? Why might ridge be performing better?
Ridge regression performs significantly better on unseen data, despite being “worse” on the training data.
## [1] 46.76802
## [1] 223.3571
all <- c(pred_lm, test_age, pred_min_ridge)
lims <- range(all)
par(mfrow = 1:2)
plot(test_age, pred_lm,
xlim = lims, ylim = lims,
pch = 19
)
abline(coef = 0:1, lty = "dashed")
plot(test_age, pred_min_ridge,
xlim = lims, ylim = lims,
pch = 19
)
abline(coef = 0:1, lty = "dashed")LASSO regularization uses the L1 norm (sum of absolute coefficient values) to shrink coefficients:
\[ \| \beta \|_1 = \sum_{j=1}^p |\beta_j| \]
This approach encourages sparse models, removing unnecessary features by shrinking some coefficients exactly to zero. The sharp boundaries of the restricted region make it more likely that solutions lie at corners, resulting in simpler models.
To balance model complexity and information retention, we choose an optimal λ. A common method is cross-validation, which splits the data into K chunks. K−1 chunks are used for training, and the remaining chunk is for testing. This process rotates through all chunks, providing a reliable estimate of how well each λ value generalizes to new data.
We can use this new idea to choose a lambda value by finding the lambda that minimises the error across each of the test and training splits. In R:
# fit lasso model with cross-validation across a range of lambda values
lasso <- cv.glmnet(methyl_mat, age, alpha = 1)
plot(lasso)# Extract the coefficients from the model with the lowest mean squared error from cross-validation
coefl <- coef(lasso, lasso$lambda.min)
# select only non-zero coefficients
selection <- which(coefl != 0)
# and convert to a normal matrix
selected_coefs <- as.matrix(coefl)[selection, 1]
selected_coefs## (Intercept) cg02388150 cg06493994 cg22449114 cg22736354
## -8.4133296328 0.6966503392 0.1615535465 6.4255580409 12.0507794749
## cg03330058 cg09809672 cg11299964 cg19761273 cg26162695
## -0.0002362055 -0.7487594618 -2.0399663416 -5.2538055304 -0.4486970332
## cg09502015 cg24771684 cg08446924 cg13215762 cg24549277
## 1.0787003366 4.5743800395 -0.5960137381 0.1481402638 0.6290915767
## cg12304482 cg13131095 cg17962089 cg13842639 cg04080666
## -1.0167896196 2.8860222552 6.3065284096 0.1590465147 2.4889065761
## cg06147194 cg03669936 cg14230040 cg19848924 cg23964682
## -0.6838637838 -0.0352696698 0.1280760909 -0.0006938337 1.3378854603
## cg13578928 cg02745847 cg17410295 cg17459023 cg06223736
## -0.8601170264 2.2346315955 -2.3008028295 0.0370389967 1.6158734083
## cg06717750 cg20138604 cg12851161 cg20972027 cg23878422
## 2.3401693309 0.0084327521 -3.3033355652 0.2442751682 1.1059030593
## cg16612298 cg03762081 cg14428146 cg16908369 cg16271524
## 0.0050053190 -6.5228858163 0.3167227488 0.2302773154 -1.3787104336
## cg22071651 cg04262805 cg24969251 cg11233105 cg03156032
## 0.3480551279 1.1841804186 8.3024629942 0.6130598151 -1.1121959544
We can see that cross-validation has selected a value of λ resulting in 44 features and the intercept.
Elastic net regression combines the properties of ridge (α = 0) and LASSO (α = 1) regression, blending their advantages. It drops uninformative variables like LASSO while maintaining conservative coefficient estimates like ridge, leading to improved predictions.
Elastic net optimizes the following: \[ \sum_{i=1}^N \left( y_i - x'_i\beta \right)^2 + \lambda \left( \alpha \|\beta\|_1 + (1-\alpha) \|\beta\|_2^2 \right) \]
This flexibility allows elastic net to balance variable selection and prediction accuracy. Contour plots help visualize how the penalty changes for different α values.
Fit an elastic net model (hint: alpha = 0.5) without cross-validation and plot the model object.
Fit an elastic net model with cross-validation and plot the error. Compare with LASSO.
Fitting an elastic net model is just like fitting a LASSO model. You can see that coefficients tend to go exactly to zero, but the paths are a bit less extreme than with pure LASSO; similar to ridge.
The process of model selection is similar for elastic net models as for LASSO models.